Dealing with missing values in large-scale studies: microarray data imputation and beyond

نویسنده

  • Tero Aittokallio
چکیده

High-throughput biotechnologies, such as gene expression microarrays or mass-spectrometry-based proteomic assays, suffer from frequent missing values due to various experimental reasons. Since the missing data points can hinder downstream analyses, there exists a wide variety of ways in which to deal with missing values in large-scale data sets. Nowadays, it has become routine to estimate (or impute) the missing values prior to the actual data analysis. After nearly a decade since the publication of the first missing value imputation methods for gene expression microarray data, new imputation approaches are still being developed at an increasing rate. However, what is lagging behind is a systematic and objective evaluation of the strengths and weaknesses of the different approaches when faced with different types of data sets and experimental questions. In this review, the present strategies for missing value imputation and the measures for evaluating their performance are described. The imputation methods are first reviewed in the context of gene expression microarray data, since most of the methods have been developed for estimating gene expression levels; then, we turn to other large-scale data sets that also suffer from the problems posed by missing values, together with pointers to possible imputation approaches in these settings. Along with a description of the basic principles behind the different imputation approaches, the review tries to provide practical guidance for the users of high-throughput technologies on how to choose the imputation tool for their data and questions, and some additional research directions for the developers of imputation methodologies.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

چند رویکرد برخورد با مقادیر گمشده‌ متغیرهای کمی و بررسی اثر آنها بر نتایج حاصل از یک کارآزمایی‌ بالینی

Background and Objectives: A major challenge that affects the longitudinal studies is the problem of missing data. Missing in the data may result in the loss of part of the information which reduces the accuracy of the estimator and obtain the results will be biased and inaccurate. Therefore, it is necessary to evaluate the missing data mechanism from a longitudinal research and to consider thi...

متن کامل

Missing Value Estimation of Epistatic Miniarray Profiling Data by Kernel Pca Regression Ensemble Approach

Missing data imputation is a key issue in learning from incomplete data. Various techniques have been developed with great success on dealing with missing values in data sets with heterogeneous attributes (their independent attributes are of different types) referred to as imputing mixed-attribute data sets. Epistatic miniarray profiling (E-MAP) is a powerful tool for analyzing gene functions a...

متن کامل

تحلیل درستنمایی ماکزیمم مدل رگرسیون لجستیک در حالتی که داده های متغیرهای پیشگو کامل نیستند ولی متغیرهای کمکی وجود دارند

Background and Objectives: Missing data exist in many studies, e.g. in regression models, and they decrease the model's efficacy. Many methods have been suggested for handling incomplete data: they have generally focused on missing outcome values. But covariate values can also be missing.Materials and Methods: In this paper we study the missing imputation by the EM algorithm and auxiliary varia...

متن کامل

Influence of Pattern of Missing Data on Performance of Imputation Methods: An Example from National Data on Drug Injection in Prisons

Background Policy makers need models to be able to detect groups at high risk of HIV infection. Incomplete records and dirty data are frequently seen in national data sets. Presence of missing data challenges the practice of model development. Several studies suggested that performance of imputation methods is acceptable when missing rate is moderate. One of the issues which was of less concern...

متن کامل

Gaussian mixture clustering and imputation of microarray data

MOTIVATION In microarray experiments, missing entries arise from blemishes on the chips. In large-scale studies, virtually every chip contains some missing entries and more than 90% of the genes are affected. Many analysis methods require a full set of data. Either those genes with missing entries are excluded, or the missing entries are filled with estimates prior to the analyses. This study c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Briefings in bioinformatics

دوره 11 2  شماره 

صفحات  -

تاریخ انتشار 2010